Goto

Collaborating Authors

 entrance exam


Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

Pappone, Francesco, Lazzaroni, Ruggero Marino, Califano, Federico, Gentile, Niccolò, Marras, Roberto

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks


BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

Santos, João Guilherme Alves, Bonás, Giovana Kerche, Almeida, Thales Sales

arXiv.org Artificial Intelligence

With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. W e present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. W e evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.


Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams

Pires, Ramon, Almeida, Thales Sales, Abonizio, Hugo, Nogueira, Rodrigo

arXiv.org Artificial Intelligence

Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino M\'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities. Our study not only reaffirms the capabilities of GPT-4 as the state of the art for handling complex multidisciplinary questions, but also pioneers in offering a realistic assessment of multimodal language models on Portuguese examinations. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.


ChatGPT is suddenly everywhere. Are we ready?

Engadget

For a product that its own creators, in a marketing pique, once declared "too dangerous" to release to the general public, OpenAI's ChatGPT is seemingly everywhere these days. The versatile automated text generation (ATG) system, which is capable of outputting copy that is nearly indistinguishable from a human writer's work, is officially still in beta but has already been utilized in dozens of novel applications, some of which extend far beyond the roles ChatGPT was originally intended for -- like that time it simulated an operational Linux shell or that other time when it passed the entrance exam to Wharton Business School. The hype around ChatGPT is understandably high, with myriad startups looking to license the technology for everything from conversing with historical figures to talking to historical literature, from learning other languages to generating exercise routines and restaurant reviews. But with these technical advancements come with a slew of opportunities for misuse and outright harm. And if our previous hamfisted attempts at handling the spread of deepfake video and audio technologies were any indication, we're dangerously underprepared for the havoc that at-scale, automated disinformation production will wreak upon our society.


On The Reasons Behind Decisions

Darwiche, Adnan, Hirth, Auguste

arXiv.org Artificial Intelligence

Recent work has shown that some common machine learning classifiers can be compiled into Boolean circuits that have the same input-output behavior. We present a theory for unveiling the reasons behind the decisions made by Boolean classifiers and study some of its theoretical and practical implications. We define notions such as sufficient, necessary and complete reasons behind decisions, in addition to classifier and decision bias. We show how these notions can be used to evaluate counterfactual statements such as "a decision will stick even if ... because ... ." We present efficient algorithms for computing these notions, which are based on new advances on tractable Boolean circuits, and illustrate them using a case study.


Schools tapping smartphone and tablet apps to engage a new generation

The Japan Times

Smartphone and tablet computer apps are seeing increasing use in Japanese schools as teachers look to capitalize on what has become many young people's preferred window to the world. Artificial intelligence-assisted apps have become prevalent in education, particularly in subjects many Japanese teachers struggle to teach well. One subject educators need help with is teaching English, a task that will become all the more important when speaking ability enters the joint achievement test in 2020, part of Japan's high-pressure university entrance exams. Nippon Sports Science University Kashiwa High School in Chiba Prefecture uses an app called TerraTalk to help students improve their English conversation skills. The school introduced the app last summer for use by students planning to study abroad.


Robots Behaving Badly

#artificialintelligence

Summary: For your holiday reading we present this selection of robot and AI fails. We hope this brings you hope and cheer for the coming year to know that our robot overlords are not as close as some think. For your holiday reading we present this selection of robot and AI fails. We hope this brings you hope and cheer for the coming year to know that our robot overlords are not as close as some think. Alas, it seems we've got a few more years before the robots take over.


Video Friday: Powered Exoskeleton, Drone Shows, and Soft Robotic Mask

IEEE Spectrum Robotics

Video Friday is your weekly selection of awesome robotics videos, collected by your Automaton bloggers. We'll also be posting a weekly calendar of upcoming robotics events for the next two months; here's what we have so far (send us your events!): Let us know if you have suggestions for next week, and enjoy today's videos. I don't know much about this powered partial exoskeleton called KOMA, except that the company behind it (ATOUN, from Japan) says that it's designed to help you carry very heavy objects in a way that won't interfere with your natural movements. Jiří Zemánek and Martin Gurtner from the Czech Technical University in Prague won first place in the IEEE CSS video contest (awarded at the IEEE CCTA 2017 conference) for their video demonstrating numerical optimal control on a "flying ball in a hoop" system: The IEEE CCTA Conference, incidentally, was held on the Kohala Coast in Hawaii, where as far as I know we have not had a major robotics conference recently.


Can a robot pass a university entrance exam?

#artificialintelligence

Meet Todai Robot, an AI project that performed in the top 20 percent of students on the entrance exam for the University of Tokyo -- without actually understanding a thing. While it's not matriculating anytime soon, Todai Robot's success raises alarming questions for the future of human education. How can we help kids excel at the things that humans will always do better than AI? Could an AI pass the entrance exam for the University of Tokyo? Noriko Arai oversees a project that wants to find out. Could an AI pass the entrance exam for the University of Tokyo?


Robot created to ace Chinese math exams barely beats humans

#artificialintelligence

AI-MATHS is trying to get into college. Unbeknownst to students sitting for China's grueling college entrance exams, a robot was vying for a spot in one of the country's most prestigious universities. But it doesn't pose much threat yet. Called AI-MATHS, the robot completed one version of a two-hour Maths paper in 22 minutes and scored 105 points out of the maximum score of 150, reports the state-run Xinhua. AI-MATHS scored 100 points in another version of the paper which it completed in 10 minutes.